The project started on 19 January 2017. Updates will be available here: https://github.com/tlehtiniemi/IODS-project
Let’s first read data from the local file into a data frame
learning2014 <- read.table("/Users/lehtint9/IODS-project/data/learning2014.txt")
Next let’s have a look a the dimensions and structure of the data set
dim(learning2014)
## [1] 166 7
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.75 2.88 3.88 3.5 3.75 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
The data frame has 166 observations of 7 variables. The variables are gender (female or male), age, four attributes (attitude, deep, stra, surf) and the total points.
To examine the variables, let us print a summary of them. This show overview statistics of all variables.
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :14.00 Min. :1.625 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.500 1st Qu.:2.625
## Median :22.00 Median :32.00 Median :3.875 Median :3.188
## Mean :25.51 Mean :31.43 Mean :3.796 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.250 3rd Qu.:3.625
## Max. :55.00 Max. :50.00 Max. :4.875 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
We will need the following additional libraries to analyse the data graphically:
library(GGally)
library(ggplot2)
To explore the variables graphically and specifically the relationships between them, we will draw an overview plot matrix with ggpairs
p <- ggpairs(learning2014, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
p
The
gender distribution is clearly skewed, age as well (as expected with what I assume are university students). age does not significantly correlate with any of the variables.
The three highest linear correlations (in terms of abosulte value) with the points variable are in attitude (0.437), stra (0.146) and surf (-0.144), however correlations between points and stra & surf are rather low. surf seems to be to some extent correlated with attitude (-0.176) and stra (-0.161). Correlation between stra and attitude is low (0.0617).
Based on the observations on correlations, let’s regress points on attitude, stra and surf and print out a summary of the model
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The statistically significant coefficients are the intercept (at 0.01 level) and attitude (at 0.001 level).
As stra and surf were not statistically significant explanatory variables, let’s remove them from the model
my_model2 <- lm(points ~ attitude, data = learning2014)
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
This had a small impact on the remaining coefficients and diminished the R-squared by a small amount (as expected) in comparison to the previous model. Now the intercept and attitude coefficients are both significant at 0.001 level. As the intercept coefficient is 11.6, a person with zero attitude would get 11.6 points according to the model. Thereafter each unit increase in attitude increases the points by about 0.35 according to the model. As per the multiple R-squared diagnostic, this model is capable of explaining about 19% of the variability of points variable around its mean.
Let’s plot this model to aid in diagnostics.
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
Now on to diagnostics of the model.
par(mfrow = c(2,2))
plot(my_model2, c(1, 2, 5))
Residuals seem to be somewhat correlated with model predictions. Visually, the residuals seem smaller the higher the model predictions are, apart from a few outliers. This would indicate that the constant variance of errors assumption does not hold. On a visual inspection, even though it exists, this effect is not very severe.
Based on the QQ plot, the errors are not exactly normally distributed. Even though the errors follow the theoretical errors closely for low and mid-range errors (in absolute terms), the errors deviate from the line in the high end (in absolute terms) of the error distribution. The deviations however are comparatively low.
On the residuals vs leverage plot, we see that there are some obervations with comparatively high leverage. These are likely the ones with very low attitude and very low points, as they would “pull” the regression line downwards from the low end. However the leverage coefficients are not high in absolute terms for any observation. Therefore we sould probably not be too much worried about outliers.
Again let’s first load libraries and read data from the local file into a data frame
library(dplyr); library(ggplot2); library(boot)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
alc <- read.table("/Users/lehtint9/IODS-project/data/Ch3_alc.csv", sep=";", header=TRUE)
The data contains responses to a survey for students of two classes. It has demographic and backrground variables, information about freetime and studies, and about alcohol consumption.
It has 35 variables for 382 students.
Variable names are
dim(alc)
## [1] 382 35
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
In this exercise we are interested in the alcohol consumption of students. We’ll start with the hyptohesis that four variables are related to alcohol consumption. It seems reasonable that the following variables could have a connection with alcohol consumption:
sex is the student’s sex (binary: ‘F’ - female or ‘M’ - male) (factor variable)romantic tells if there is a romantic relationship (binary: yes or no) (factor variable)studytime is the weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)famrel is the quality of family relationships (numeric: from 1 - very bad to 5 - excellent)g1a <- ggplot(data = alc, aes(x=high_use, col=sex))
g1b <- g1a+geom_bar()+facet_wrap("sex")
g1b
There seems to be proportionally more high alcohol users in males than in females. Let’s look at this in cross-tabulation also. High use is on rows, gender on columns. There are proportionally more male high users.
mytable <- table(alc$high_use, alc$sex)
mytable # print table
##
## F M
## FALSE 156 112
## TRUE 42 72
prop.table(mytable, 2) # column percentages
##
## F M
## FALSE 0.7878788 0.6086957
## TRUE 0.2121212 0.3913043
Let’s look at all the other variables in relation to both high_use and sex.
First romantic relations.
mytable <- table(alc$high_use, alc$romantic)
mytable # print table
##
## no yes
## FALSE 180 88
## TRUE 81 33
prop.table(mytable, 2) # column percentages
##
## no yes
## FALSE 0.6896552 0.7272727
## TRUE 0.3103448 0.2727273
A higher percentage of those who are not in romantic relations are high alcohol users. The effect seems small though, but we’ll know more later. Let’s visualize this too.
g2a <- ggplot(data = alc, aes(x=high_use, col=sex))
g2b <- g2a+geom_bar()+facet_wrap("romantic")+ggtitle("Romantic relation, high alcohol use and gender")
g2b
Next, studytime
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_studytime = mean(studytime))
## Source: local data frame [4 x 4]
## Groups: sex [?]
##
## sex high_use count mean_studytime
## <fctr> <lgl> <int> <dbl>
## 1 F FALSE 156 2.339744
## 2 F TRUE 42 2.000000
## 3 M FALSE 112 1.883929
## 4 M TRUE 72 1.638889
With both males and females, studytimes are higher on the average with non-high-users. In males the average studytimes are lower overall than in females.
Then family relations or famrel.
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_famrel = mean(famrel))
## Source: local data frame [4 x 4]
## Groups: sex [?]
##
## sex high_use count mean_famrel
## <fctr> <lgl> <int> <dbl>
## 1 F FALSE 156 3.910256
## 2 F TRUE 42 3.761905
## 3 M FALSE 112 4.133929
## 4 M TRUE 72 3.791667
With both males and females, family relations are better on the average with non-high-users. In males the difference in mean family realtions of high users and non-high users is higher.
Let’s fit a logistic model to the high_use data. We are fitting the model to find what factors are correlated with higher alcohol consumption among students.
m <- glm(high_use ~ sex + romantic + studytime + famrel, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ sex + romantic + studytime + famrel,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4916 -0.8515 -0.6620 1.2041 2.1353
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.8894 0.6030 1.475 0.14022
## sexM 0.7199 0.2448 2.941 0.00327 **
## romanticyes -0.1301 0.2551 -0.510 0.61018
## studytime -0.4631 0.1562 -2.965 0.00302 **
## famrel -0.3022 0.1262 -2.395 0.01662 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 434.86 on 377 degrees of freedom
## AIC: 444.86
##
## Number of Fisher Scoring iterations: 4
Model summary shows that two of the variables are significant at the 0.01 level: gender and studytime. Family relations are significant at the 0.05 level. romantic relations did not have astatistically significant effect. Effect of sex was so that being male had a positive correlation with high alcohol use. studytime and famrel had a negative correlation; more studying and better family relations were correlated with less alcohol consumption. The directions of the statistically significant (0.01 or 0.05 level) effects were as hypothesized initially.
Below, the result of fitting the model are presented as odds ratios. This means that a unit change in the explanatory variable (vs. no change) changes the odds of high alcohol use by the corresponding odds ratio. Also confidence intervals are printed out.
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 2.4336272 0.7466719 8.0045769
## sexM 2.0541668 1.2749343 3.3342592
## romanticyes 0.8780501 0.5285733 1.4404187
## studytime 0.6293367 0.4591485 0.8484560
## famrel 0.7391943 0.5763518 0.9468467
According to the model, males have the odds ratio of about 2, meaning that there’s a twofold chance of a male being high consumer of alcohol compared to a female.
romantic relations were not a statistically significant explanatory variable. As an additional note, in the odds ratios, we also see that the 95% confidence interval spans from 0.52 to 1.44, which indicates ambiquous relationship to alcohol consumption.
A unit increase in the studytime variable leads to 0.62-fold odds to consume more alcohol. A unit increase in family relations or famrel variable leads to 0.74-fold odds to consume more alcohol. Both have a negative effect to the odds of high_use, but practical interpretation of the effect is difficult due to the categorical nature of these variables. In particular, studytime is not in hours but is categorised with non-uniform intervals of hours, so this result should be interpreted in terms of jumps from category to another.
Let’s now refit the model, leaving out romantic relations - as important as they are in real life, they are not statistically significant in this case.
m2 <- glm(high_use ~ sex + studytime + famrel, data = alc, family = "binomial")
summary(m2)
##
## Call:
## glm(formula = high_use ~ sex + studytime + famrel, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.530 -0.854 -0.666 1.218 2.157
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.8395 0.5955 1.410 0.15861
## sexM 0.7257 0.2444 2.970 0.00298 **
## studytime -0.4679 0.1563 -2.994 0.00275 **
## famrel -0.2982 0.1259 -2.368 0.01786 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 435.12 on 378 degrees of freedom
## AIC: 443.12
##
## Number of Fisher Scoring iterations: 4
The summary statistics look as expected. famrel is significant only on 0.05 level, but let’s keep it in the model anyway.
Now we’ll predict high use using this new model. We know the actual values of high_use, so we can compare those to predictions. Ideally there’s nothing on contradictory cells of the confusion matrix:
model_prob <- predict(m2, type = "response")
alc <- mutate(alc, high_use_prob = model_prob)
alc <- mutate(alc, high_use_pred = high_use_prob>0.5)
table(high_use = alc$high_use, prediction = alc$high_use_pred)
## prediction
## high_use FALSE TRUE
## FALSE 255 13
## TRUE 102 12
Quite a few false negatives in the predictions. Let’s visualize the results a bit more.
g <- ggplot(alc, aes(x = high_use_prob, y = high_use, col=high_use_pred))
g+geom_point()
table(high_use = alc$high_use, high_use_pred = alc$high_use_pred) %>% prop.table() %>% addmargins()
## high_use_pred
## high_use FALSE TRUE Sum
## FALSE 0.66753927 0.03403141 0.70157068
## TRUE 0.26701571 0.03141361 0.29842932
## Sum 0.93455497 0.06544503 1.00000000
which basically tells the same story. The model predicts too many negatives. High training error at 0,43. Not a good prediction! Very bad. Sad!
To quantify this a bit, we’ll define a loss function and use it to compare predictions to simple guesswork.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
Now the guesswork: let’s guess that there are no high users of alcohol. The loss function is
loss_func(class = alc$high_use, prob = 0)
## [1] 0.2984293
And then let’s use the actual predictions. Now the loss function is
loss_func(class = alc$high_use, prob = alc$high_use_prob)
## [1] 0.3010471
Using this measure, the guesswork is actually a little bit better! So the model was actually really sad.
Let’s perform 10-fold cross validation on the model and print out the loss function for it.
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)
cv$delta[1]
## [1] 0.3062827
So this model performed worse than the model in datacamp, loss function .306 versus in datacamp .26.
What would be a better model? One way to approach this would be to start from the moel in datacamp (predictors were failures, absences, and sex) and add a predictor that increases the predictive power of the model. In this exercise, the “best” variable used as a predictor was studytime. Let’s add that to the model, perform the cross validation and print out the loss function.
m3 <- glm(high_use ~ sex + failures + absences + studytime, data = alc, family = "binomial")
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m3, K = 10)
cv$delta[1]
## [1] 0.2643979
And voila, we have a better model (.24) than in the datacamp exercise (.26)!
In this Chapter, we will be using a dataset on suburbs of Boston. We are aiming to employ classification and clustering on the dataset. In classification, we know the classes, and aim to fit a model that can classify observations into these classes based on other variables. We will classify the suburbs according to their crime rate. In clustering, we are trying to find similar groups or clusters of observations withouth prior knowledge of the clusters.
Let’s first load the libraries needed in this Chapter.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(dplyr)
library(corrplot)
library(NbClust)
The dataset is included in the MASS package. Let’s load the Boston dataset and explore it’s contents.
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The dataset has 14 variables and 506 observations. The variables describe characteristics of 506 suburbs of Boston, including the crime rate crim that we will use for classification, and 13 other variables.
We are most interested in the ‘crim’ variable that descibers the per capita crime rate of the area. The crime rate is a continuous variable that varies highly between areas: the max crime rate is really high compared to the median crime rate. Looking at the quantiles, we can see that this is partly due to outliers, as the 3rd quantile is much smaller than the max value. We note that also the 3rd quantile is already very high compared to the minimum value.
Also some of the other variables have uneven distributions: e.g. black (scaled proportion of blacks), indus (proportion of non-retail business acres), age (proportion of owner-occupied units built prior to 1940) and lstat (lower status of the population (percent)).
To explore the relations between the variables of the dataset, let’s print out pairwise scatter plots and a correlation plot.
pairs(Boston)
cor_matrix<-cor(Boston) %>% round(2)
corrplot.mixed(cor_matrix)
By visual inspection, the crime rate is correlated with many of the variables. Again, since the crime rate is (and other variables are) highly variable, it is not easy to interpret the correlations from scatter plots. However, from the correlation plot, crime rate is negatively correlated with e.g. housing values and distances to employment centers, and positively correlated with e.g. access to radian highways rad and property tax rate tax. High correlation among other variables than crim are also found.
LDA assumes that the variables are normally distributed and have the same variance. therefore, we’ll next standardize the dataset by subtracting the mean of the variable from each observation, and dividing this with the standard deviation of the variable.
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
The above summary of the scaled dataset shows that now each variable now has a mean of 0, that is, the distribution is centered around zero.
In order to make use of the crime rate in classification, it needs to be a categorical variable. Let’s replace the crim variable with a categorical version of the crime rate, or the crime variabel, by categorising crim into its quartiles. The crime categories are labeled low, med_low, med_high and high.
scaled_crim <- boston_scaled$crim
bins <- quantile(scaled_crim)
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
We’ll use an LDA model to classify the suburbs into crime rate classes. First we divide the dataset into a training dataset and a test dataset. We will perform classification on the training dataset, and then employ the test dataset to see how well the classification performs on new data. We do this by picking random 80% of the dataset for the training dataset, and then leave the rest as the test dataset. Test dataset does not inlude the crime rates; these are stored separately.
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
Now we’ll use LDA classification to fit a model that classifies the training dataset into the crime classes. We make use of the lda.arrows function as presented in the datacamp exercise in order to print out a biplot, that is, a plot that shows a scatter plot of the classes according to the .
lda.fit <- lda(crime ~ ., data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col=classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
Now we’ll predict the crime rates of the test dataset based on the model fitted on the training dataset.
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 19 5 0 0
## med_low 12 21 2 0
## med_high 0 11 11 2
## high 0 0 0 19
Most of the predictions are at the diagonal of the cross tabulation. Prediction error was about 25%. This is probably an ok result
With K-means clustering, we are aiming to find clusters of similar observations from the data, without prior knowledge of these clusters. For clustering, we’ll use the scaled Boston dataset, so that the distances are comparable. The following print out summaries of the Euclidian and Manhattan distances matrices of the scaled Boston dataset. Euclidian in the geometric distance, Manhattan is the distane measured along the axes.
set.seed(123)
boston_clusters <- as.data.frame(scale(Boston))
dist_eu <- dist(boston_clusters)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4620 4.8240 4.9110 6.1860 14.4000
dist_man <- dist(boston_clusters, method="manhattan")
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4830 12.6100 13.5500 17.7600 48.8600
Not surprisingly, Manhattan distances are higher.
Let’s perform the K-means clustering with Euclidian distances and K=3 clusters
km <-kmeans(dist_eu, centers = 3)
pairs(Boston, col = km$cluster)
For the purpose of finding optimal number of clusters, we’ll explore the total within cluster sum of squares (twcss) with the number of clusters ranging from 1 to 10.
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
plot(1:k_max, twcss, type='b')
The interpretation the the twcss number is, that the optimal number of clusters is when the twcss drops radically. There, it obviously drops radically when K changes from 1 to 2. Also 2 to 3 might be interpreted as a radical drop (the situation depends on the locations of the initial (random) cluster centers, but this drop was about 25% of twcss when initial centers were assigned using the set.seed(123) function as above). We could stick to K=2 as the optimal, but this is debatable.
So how to determine the number of clusters? Let’s run a set of tests to see if we can find some consensus. Used in this way, NbClust performs a number of tests and reports the best number of clusters based on majority rule.
nb_clusters <- NbClust(data = data.matrix(boston_scaled), distance = "euclidean", min.nc = 2, max.nc = 15, method = "kmeans", index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 9 proposed 2 as the best number of clusters
## * 3 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 3 proposed 6 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
## * 3 proposed 11 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
## * 2 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
So let’s use K=2 as the optimal number of clusters.
km <-kmeans(dist_eu, centers = 2)
pairs(Boston, col = km$cluster)
With K=2 clusters, we can see from the pairwise scatter plots that some of the variables are clearly divided between the clusters (in the sense of these pairwise plots), some are not.
The crime rate, for example, is clearly divided so that one of the clusters inlcudes only low rime rates, the other inlcludes both low and high crime rates. Some of the other variables are clearly dichotomous in when plotted against others - for example, rad and tax feature such tendency.
For this exercise we will use the “human”" dataset from the UNDP programme. The dataset has countries as row names. The dataset includes 155 observations of 8 variables.
human <- read.csv("/Users/lehtint9/IODS-project/data/human.csv",row.names = 1, sep=",", header=TRUE)
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
The eight variables describe the health, knowledge and empowerment conditions in the countries.
Edu2.FM Proportion of females with at least secondary education divided by proportion of males with at least secondary educationLabo.FM Proportion of females in the labour force divided by proportion of males in the labour forceEdu.Exp Expected years of schoolingLife.Exp Life expectancy at birthGNI Gross National Income per capitaMat.Mor Maternal mortality ratioAdo.Birth Adolescent birth rateParli.F Percetange of female representatives in parliamentlibrary(GGally)
library(ggplot2)
library(dplyr)
library(corrplot)
ggpairs(human)
cor(human) %>% corrplot.mixed()
From the plots, it is clear that there are rather strong correlations between six variables variables in the dataset. The strongest positive correlations are found between expected years of schooling and life expectancy (
Edu.Exp and Life.Exp) as well as maternal mortality and adolescent birth rate (Mat.Mor and Ado.Birth). The strongest negative correlations are found between the above-mentioned strongly positively correlated variables and the strongly negatively correlated ones. Interestingly, two variables (Labo.FM and Parli.F) are only weakly correlated with any of the other variables.
PCA transforms the dataset into a new space. The dimensions of this new space are called the principal components of data. The first principal component captures the maximum amount of variance from the features of original data. Each suggessive principal component is orthogonal to the first (and other) components, and captures the maximum amount of variance left.
First we use the dataset as is and perform principal component analysis of it.
pca_human <- prcomp(human)
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
As we already see from the model summary, the first principal component PC1 captures almost all of the variance in the data by itself. Let’s plot the biplot to inspect the model visually.
biplot(pca_human, choices = 1:2, cex=c(0.7, 1), col = c("grey40", "deeppink2"))
We see from the biplot that the first principal componen, PC1, is practically identical to the GNI feature. The GNI arrow is much longer than the other arrows, reflecting its standard deviation. This is due to the variance of the GNI variable being really high compared to the other variables. Therefore, PCA ends up just explainen the variance of the data by the variance of GNI.
Since PCA for the original data did not provide very good results, we’ll sandardise the dataset so that the variances are equal.
human_std <- scale(human)
pca_human_std <- prcomp(human_std)
biplot(pca_human_std, choices = 1:2, cex=c(0.7, 1), col = c("grey40", "deeppink2"))
These CA results are clearly different from the results of PCA on non-standardized dataset: GNI does not dominate the variance anymore, and the results make much more “sense”.
Interpreting the results of PCA from the bilplot, we see that
Low values of PC1 describe countries with high expected years of schooling, high life expectancy, high female education compared to male education, and high GNI. Simultaneously, high values of PC1 describe countries with high maternal mortality ratio and high adolescent birth rate. The PC1 feature, then, could be interpreted to describe “inverse of standard of living”.
High values of PC2 describe countries with high proportion of females in labor force compared to males, and hih percentage of females in parliament. The PC2 feature, then, could be interpreted to describe “Gender equality in society”.
These interpretations of PC and PC2 are, of course, rough descriptions of much more nuanced societal phenomena.
For the remaining part of the exercise we’ll use the tea dataset from FactoMineR.
library(FactoMineR)
data(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
The tea data has 300 observations of 36 variables. We’ll pick some variables that seem interesting (places to drink tea, gender, and age groups) and visualise only these variables.
library(tidyr)
keep <- c("home", "work", "tearoom", "resto", "pub", "age_Q", "sex")
tea_ <- dplyr::select(tea, one_of(keep))
gather(tea_) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables; they will
## be dropped
We’ll also perform MCA for these variables.
tea_mca <- MCA(tea_, graph = FALSE)
summary(tea_mca)
##
## Call:
## MCA(X = tea_, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.223 0.198 0.168 0.159 0.151 0.128
## % of var. 15.607 13.877 11.736 11.152 10.600 8.933
## Cumulative % of var. 15.607 29.484 41.220 52.372 62.971 71.905
## Dim.7 Dim.8 Dim.9 Dim.10
## Variance 0.115 0.103 0.097 0.087
## % of var. 8.028 7.198 6.769 6.100
## Cumulative % of var. 79.933 87.131 93.900 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | -0.403 0.243 0.123 | 0.128 0.028 0.012 | -0.319 0.203
## 2 | -0.219 0.071 0.057 | -0.250 0.105 0.074 | -0.127 0.032
## 3 | 0.509 0.388 0.175 | 0.262 0.116 0.046 | -0.145 0.042
## 4 | -0.545 0.443 0.413 | -0.172 0.050 0.041 | -0.167 0.055
## 5 | -0.498 0.370 0.259 | 0.094 0.015 0.009 | -0.049 0.005
## 6 | -0.545 0.443 0.413 | -0.172 0.050 0.041 | -0.167 0.055
## 7 | -0.403 0.243 0.123 | 0.128 0.028 0.012 | -0.319 0.203
## 8 | -0.124 0.023 0.013 | -0.216 0.079 0.039 | -0.398 0.315
## 9 | -0.403 0.243 0.123 | 0.128 0.028 0.012 | -0.319 0.203
## 10 | 0.081 0.010 0.003 | 0.066 0.007 0.002 | -0.027 0.001
## cos2
## 1 0.077 |
## 2 0.019 |
## 3 0.014 |
## 4 0.039 |
## 5 0.003 |
## 6 0.039 |
## 7 0.077 |
## 8 0.131 |
## 9 0.077 |
## 10 0.000 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## home | -0.006 0.002 0.001 -0.604 | -0.072 0.360 0.167
## Not.home | 0.199 0.076 0.001 0.604 | 2.321 11.649 0.167
## Not.work | -0.340 5.249 0.282 -9.190 | -0.260 3.454 0.165
## work | 0.832 12.851 0.282 9.190 | 0.636 8.455 0.165
## Not.tearoom | -0.309 4.941 0.399 -10.920 | 0.037 0.081 0.006
## tearoom | 1.290 20.615 0.399 10.920 | -0.156 0.338 0.006
## Not.resto | -0.325 4.993 0.296 -9.406 | -0.185 1.813 0.096
## resto | 0.910 13.968 0.296 9.406 | 0.517 5.072 0.096
## Not.pub | -0.283 4.058 0.302 -9.495 | 0.106 0.644 0.043
## pub | 1.065 15.264 0.302 9.495 | -0.400 2.423 0.043
## v.test Dim.3 ctr cos2 v.test
## home -7.059 | -0.039 0.126 0.049 -3.847 |
## Not.home 7.059 | 1.265 4.090 0.049 3.847 |
## Not.work -7.029 | 0.280 4.739 0.192 7.572 |
## work 7.029 | -0.685 11.602 0.192 -7.572 |
## Not.tearoom 1.318 | -0.162 1.805 0.110 -5.724 |
## tearoom -1.318 | 0.676 7.532 0.110 5.724 |
## Not.resto -5.345 | -0.241 3.647 0.163 -6.972 |
## resto 5.345 | 0.674 10.203 0.163 6.972 |
## Not.pub 3.567 | -0.034 0.076 0.004 -1.128 |
## pub -3.567 | 0.126 0.286 0.004 1.128 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## home | 0.001 0.167 0.049 |
## work | 0.282 0.165 0.192 |
## tearoom | 0.399 0.006 0.110 |
## resto | 0.296 0.096 0.163 |
## pub | 0.302 0.043 0.004 |
## age_Q | 0.076 0.633 0.644 |
## sex | 0.205 0.278 0.012 |
plot(tea_mca, invisible=c("ind"), habillage = "quali")
The first two dimensions of MCA explain about 30% of variance.
The biplot for variables allows for some interpretation of tea drinking habits. For examples, we see that
plot(tea_mca, invisible=c("var"), habillage = "quali")
The biplot for individuals mostly shows that it is pretty hard to interpret. We mainly see that there are a few outlier individuals. This might indicate that the MCA result is to some extent dependent on these outliers.